On some difNculties with a posterior probability 
approximation technique 
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Abstract. In Scott (20021 and Congdon (20061, a new method is advanced to 



compute posterior probabilities of models under consideration. It is based solely 
on MCMC outputs restricted to single models, i.e., it is bypassing reversible jump 
and other model exploration techniques. While it is indeed possible to approxi- 
mate posterior probabilities based solely on MCMC outputs from single models. 



Eis demonstrated by Gelfand and Dey ( 1994 1 and Bartolucci et al. ( 2006 1, we show 



that the proposals of Scott ( 2002 \ and Congdon ( 2006 1 are biased and advance 



several arguments towards this thesis, the primary one being the confusion be- 
tween model-based posteriors and joint pseudo-posteriors. From a practical point 
of view, the bias in Scott's (2002) approximation appears to be much more severe 
than the one in Congdon's (2006), the latter being often of the same magnitude 
as the posterior probability it approximates, although we also exhibit an example 
where the divergence from the true posterior probability is extreme. 

Keywords: Bayesian model choice, posterior approximation, reversible jump, Markov 
Chain Monte Carlo (MCMC), pseudo-priors, unbiasedness, improperty. 



1 Introduction 

Model selection is a fundamental statistical issue and a clear asset of the Bayesian 
methodology but it faces severe computational difficulties because of the requirement 
to explore simultaneously the parameter spaces of all models under comparison accu- 
rately enough to provide sufficient approximations to the posterior probabilities of all 
models. When 'Green ( 1995 1 introduced reversible jump techniques, it was perceived 



by the community as the second MCMC revolution in that it allowed for a valid and 
efficient exploration of the collection of models and the subsequent literature on the 
topic exploiting reversible jump MCMC is a testimony to the appeal of this method. 
Nonetheless, the implementation of reversible jump techniques in complex situations 
may face difficulties or at least inefficiencies of its own and, despite some recent ad- 



vances in the devising of the jumps underlying reversible jump MCMC ( Brooks et al 



2003 1 , the care required in the construction of those jumps often acts as a deterrent 



from its applications. 

There are practical alternatives to reversible jump MCMC when the number of 
models under consideration is small enough to allow for a complete exploration of those 
models. Integral approximations using importance sampling techniques like those found 



© 2008 International Society for Bayesian Analysis 



baOOOl 



2 



Difficulties with an approximation technique 



in'Gelfand and Dey (19941, based on a harmonic mean representation of the marginal 
densities, and in Gelman and Meng ( [l998) , focussing on the optimised selection of 



the importance function, are advocated as potential solutions, see Chen et al. (20001 
for a detailed introduction. The reassessment of those methods by Bartolucci et al. 



( 2006 1 showed the connection between a virtual reversible jump MCMC and importance 



sampling (see also Chopin and Robert, 20071. In particular, those papers demonstrated 
that the output of MCMC samplers on each single model could be used to produce 
approximations of posterior probabilities of those models, via some importance sampling 



methodologies also related to Newton and Raftery (1994) 



In Scott ( 2002 1 and Congdon ( 2006 ) , a new and straightforward method is advanced 
to compute posterior probabilities of models under scrutiny based solely on MCMC 
outputs restricted to single models. While this simplicity is quite appealing for the 



approximation of those probabilities, we believe that both proposals of Scott ( 2002 ) and 



Congdon ( 2006 1 are inherently biased and we advance in this note several arguments 
towards this thesis. In addition, we notice that, to overcome the bias we thus exhibited, 
a valid solution would call for the joint simulation of parameters under all models (using 
priors or pseudo-priors) and, in this step, the primary appeal of the methods would thus 



( 2002 1 and Congdon ( 2006 1 are inspired 



be lost compared to the one proposed by Carlin and Chib ( 1995 ), from which both ,Scott| 



We want to point out at this stage that the original purpose of Scott ( 2002 ) is to 



provide a survey of Bayesian methods for the analysis of hidden Markov models and 
thus that the approximation we analyse here is introduced as a side remark within the 
whole paper. If we insist here on the bias produced by Scott's (2002) approximation, it 



is because it generated followers, including Congdon (20061, and because both approx- 



imations are based on the same erroneous interpretation of the marginal distribution 
in Bayesian model choice. We also note that Congdon's (2006) approximation often 
produces values that are numerically of the same magnitude as the true value of the 
posterior probabilities, with sometimes very close proximity as illustrated in Example 
2 of Section 3^ but also potential severe mishaps as in Example |4] of Section 3^ 



2 The methods 



In a Bayesian framework of model comparison (see, e.g., 

in competition, DJlk, with densities fk{y\Ok)i a-nd prior probabilities g^. — P{M 
[k = 1, . . . ,D), the posterior probabilities of the models 971^ conditional on the data y 
are given by 



Robert 1 12001|) , given D models 

k) 



P{M = k\y) « Qk / fk{y\ekW{ek) dOk 



the proportionality term being given by the sum of the above and M denoting the 
unknown model index. 



In the specific setup of hidden Markov models, the solution of Scott (2002, Section 
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4.1) is to generate, simultaneously and independently, D MCMC chains 

l<k<D, 

with stationary distributions TTkiOklv) and to approximate P{M — k\y) by 



t=i 



as reported in formula (21) of Scott (20021, with the indication that (21) averages the 



D likelihoods corresponding to each 9j over the life of the Gibbs sampler (p. 347), the 
latter being understood as independently sampled D parallel Gibbs samplers (p. 347). 



Adopting a more general perspective, the proposal of Congdon (20061 for an approx- 
imation of the P{M = fc|y)'s follows both from Scott's (2002) approximation and from 
the pseudo- prior construction of Carlin and Chib ( |1995 1 that predated reversible jump 
MCMC by saturating the parameter space with an artificial simulation of all parameters 
at each iteration. However, due to a very special (and, we believe, mistaken) choice of 
pseudo-priors discussed below, Congdon's (2006, p. 349) approximation of P{M — k\y) 
eventually reduces to the estimator 



Qkiy) oc £)fcE S fkiy\&, 



I D 

E 



where the ^^*^'s are samples from TTk{Ok\y) (or approximate samples obtained by an 
MCMC algorithm). This is a simple and readily implementable formula that attracted 

(|2Ci08|. 



other researchers like Chen et al 



Although both approximations gk{y) and Qkiy) clearly differ in their expressions, by 
the addition of a T^ki^'t^) term in Congdon's (2006) formula, they fundamentally relate to 
the same notion that parameters from other models can be ignored when conditioning on 
the model index M . This approach is therefore bypassing the simultaneous exploration 
of several parameter spaces and it restricts the simulation effort to marginal samplers on 
each separate model. This feature is very appealing since it cuts most of the complexity 



from the schemes both of Carlin and Chib (19951 and of Green (19951. We however 



question the foundations of those approximations as presented in both Scott ( 2002 1 and 



Congdon ( 2006 1 and advance below arguments that both authors are using incompatible 



versions of joint distributions on the collection of parameters that jeopardise the validity 
of the approximations. 



3 Difficulties 



The sections below expose the difficulties found with both methods, following the ar- 
guments advanced inlScottl (|2002l) and ICongdonl (120061) , respectively. The fundamental 



4 



Difficulties with an approximation technique 



difficulty with both approaches appears to us to stem from a confusion between the 
model dependent simulations and the joint simulations based on a pseudo-prior scheme 
as in Carlin and Chib (19951. Once this difficulty is resolved, it appears that the corre- 

k\y) by P(M 



sponding approximation of P{M = k\y) by P{M = k\y) does require a joint simulation 
of all parameters and thus that the solutions proposed in Scott (|2002) and Congdon| 
(20061 are of the same complexity as the proposal oflCarlin and Chib (19951. If single 



models MCMC chains are to be used, alternative approaches described for instance in 



Chen et al. ( 2000 1 and compared in Gamerman and Lopes, (^OOGj) can be implemented. 



3.1 Incorrect marginals 

We denote hy 9 = {9i, . . . ,6d) the collection of parameters for all models under consid- 



eration. Both Scott (2002) and Congdon (20061 start from the representation 



P{M = k\y) = j P{M = k\y, e)T:{e\y) dO 
to justify the approximation 

T 

p{M = k\y) = ^(^^ = ^\y^ ^^*^)/^ ■ 

t=l 

This is indeed an unbiased estimator of P(M = k\y) provided the 6'*^*)'s are generated 
from the correct (marginal) posterior 



A,-l 
D 



(1) 



fe=i 

D 



(2) 



fc=i 



In both papers, the6lW's are instead simulated as independent outputs from the compo- 
nentwise posteriors T^ki^kly) and this divergence jeopardises the theoretical validity of 
the approximation. The error in both interpretations stems from the fact that, while the 
Ok^'s are (correctly) independent given the model index M, this independence does not 
hold once M is integrated out, which is the case for the 0k^'s in the above approximation 
P{M = k\y). 



3.2 MCMC versus marginal MCMC 



When Congdon (2006 1 defines a Markov chain (6''^*^) at the top of page 349, he indicates 
that the components of 9^*^ are made of independent Markov chains (0^*') simulated 
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with MCMC sa mplers relate d to the respective marginal posteriors TTk{dk\y), following 



the approach of Scott (2002). The aggregated chain (9^^^) is thus stationary against the 



product of those marginals, 



D 



k=l 



However, in the derivation of Carlin and Chib ( 1995 1, the model is defined in terms 
of ([l]) and the Markov chain should thus be constructed ag ainst not against the 
product of the model marginals. Obviously, in the case of |Congdon (2006), the fact 



that the pseudo-joint distribution does not exist because of the flat prior assumption 



(see Section 3.3 for a proof) prevents this construction but, in the case the flat prior 
is replaced with a proper (pseudo-) prior, the same statement holds: the probabilistic 
derivation of P{M = k\y) relies on the pseudo-prior construction and, to be valid, it does 
require the completion step at the core of Carlin and Chib (1995), where parameters 



need to be simulated from the pseudo-priors. Generating from the component-wise 
posteriors 7Tk{dk\y) produces a bias. 



Similarly, in |Scott| ( |2002| , the target of the Markov chain {9^*\M^*'>) should be the 
distribution 

p{9,M = k\y) (X TTk{9k) Qk h{y\ek) n ""A^'j) 

and the ^j*''s should thus be generated from the prior nj(9j) when Af*^*^ ^ j — or equiva- 
lently from the corresponding marginal if one does not condition on M^*^ but simulating 
a Markov chain with stationary distribution ^ is certainly a challenge in many settings 
if the latent variable decomposing the sum is not to be used. 



Since, in both 



Scott 



(2002) and Congdon (2006), the (6'(*))'s st 
the correct target, the resulting averages of P{M = k\y, 6'^*^), Qk{y) and gk{y), will both 
be biased, as demonstrated in the examples of Section [3^ 



3.3 Improperty of the posterior 



When resorting to the construction of pseudo-posteriors adopted by |Carlin and Chib| 
(1995), Congdon] (2006) uses a Hat prior as pseudo-prior on the parameters that are 
not in model OJlfc. More precisely, the joint prior distribution on (9,M) is given by 
Congdon's (2006) formula (2), 

P{9,M= k) = nk{Ok) Qk n = ^) 

= T^k{9k) Qk , 



which is indeed equivalent to assuming a flat prior as pseudo-prior on the parameters 
9j that are not in model 2Jlfc. 

Unfortunately, this simplifying assumption has a dramatic consequence in that the 
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corresponding joint posterior distribution of 9 is never defined (as a probability distri- 
bution) since 

D 

TT{e\y) ^Y.^'kiOkly) PiM ^ k\y) 
fc=i 

does not integrate to a finite value in any of the 9k^s (unless their support is compact). 
While Congdon (20061 states that it is not essential that the priors for P{6j^k\M — k) 
are improper (p. 348), the truth is that they cannot be improper. 

The fact that the posterior distribution on the saturated vector 9 — {9i, . . . ,9d) 
does not exist obviously has negative consequences on the subsequent derivations, since 
a positive recurrent Markov chain with stationary distribution 7r(0|y) cannot be con- 
structed. Similarly, the fact that 



P(M = k\y) = J P{9,M = k\Y) 



d9 



does not hold any longer. 



Note that |Scott| ( |2002[ ) does not follow the same track: when defining the pseudo- 
priors in his formula (20), he uses the product definitiorj^ 



P{9, M ^ k) 



3^k 



which means that the true priors could also be used as pseudo-priors across all models. 



However, we stress that Scott (2002| does not refer to the construction of Carlin and| 
Chib ( 1995 ) in his proposal, nor does he use pseudo-priors in his simulations. 



3.4 Illustrations 

We now proceed through several toy examples where all posterior quantities can be 
computed in order to evaluate the bias induced by both approximations and we observe 
that, despite its theoretical bias, Congdon's (2006) can sometimes achieve a close ap- 
proximation of the posterior probability, but also that, in other settings, it may produce 
an unreliable evaluation. 

Example 1. Consider the case when a model 3Hi : y\9 ~ U{0,9) with a prior 9 ~ 
8xp{l) is opposed to a model : y\9 ^ £xp{9) with a prior 9 ~ £xp{l). We also 
assume equal prior weights on both models: gi = g2 = 0.5. 

The marginals are then 

mi(2/)= r 9-^e-U9^E,{y), 



•^The indices on the priors have been added to make notations consistent with the present paper. 
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where Ei denotes the exponential integral function tabulated both in Mathematica and 
in the GSL library, and 

For instance, when y = 0.2, the posterior probability of 9Jli is thus equal to 

P{M = l\y)^ mi{y)/{mi{y) + m2{y)} 

= Ei(y)/{Ei(y) + (l + y)-2} 
« 0.6378, 

while, for y — 0.9, it is approximately 0.4843. This means that, in the former case, 
the Bayes factor of VJli against 9Jl2 is -B12 ~ 1.760, while for the latter, it decreases to 
B12 w 0.939. 

The posterior on 9 in model 3Jt2 is a gamma Ga(2, 1 + y) distribution and it can thus 
be simulated directly. For model 9Jli, the posterior is proportional to exp{—6) for 
9 larger than y and it can be simulated using a standard accept-reject algorithm based 
on an exponential £xp(l) proposal translated by y. 

Using simulations from the true (marginal) posteriors and the approximation of 
Congdon (2006), the numerical value of gi{y) based on 10^ simulations is 0.7919 when 



y = 0.2 and 0.5633 when y = 0.9, which translates into Bayes factors of 3.805 and of 



1.288, respectively. For the approximation ofiScott (20021, the numerical value of gi{y) 



is 0.6554 (corresponding to a Bayes factor of 1.898) when y = 0.2 and 0.6789 when 
y = 0.9 (corresponding to a Bayes factor of 2.11), based on the same simulations. Note 
that in the case y = 0.9, a selection based on either approximation of the Bayes factor 
would select the wrong model. 

If we use instead a correct simulation from the joint posterior ([2]), which can be 
achieved by using a Gibbs scheme with target distribution P{9, M = k\y), we then get 
a proper MCMC approximation to the posterior probabilities by the P{M = fc|j/)'s. For 
instance, based on 10^ simulations, the numerical value of P{M = l\y) when y — 0.2 
is 0.6370, while, for y = 0.9, it is 0.4843. Note that, due to the impropriety difficulty 



exposed in Section 3.3 the equivalent correction for Congdon's (2006) scheme cannot 
be implemented. 

In Figure[T| the three approximations are compared to the exact value of P{M = l|y) 
for a range of values of y. The correct simulation produces a graph that is indistinguish- 
able from the true probability, while Congdon's (2006) approximation stays within a 
reasonable range of the true value and Scott's (2002) approximation drifts apart for 
most values of y. 



The above correspondence of what is essentially Carlin and Chib's (1995) scheme 
with the true numerical value of the posterior probability is obviously unsurprising in 
this toy example but more advanced setups see the approximation degenerate, since 
the simulations from the prior are most often inefficient, especially when the number 
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Figure 1: Example [ij Comparison of three approximations of P{M = l\y) with the 
true value (in blue and full lines): Scott's (2002) approximation (in green and mixed 
dashes), Congdon's (2006) approximation (in brown and dashes), while the correction of 
Scott's (2002) approximation is indistinguishable from the true value (based on = 10^ 
simulations) . 



of models under comparison is large. This is the reason why Carlin and Chib (19951 
introduced pseudo-priors that were closer approximations to the true posteriors. 

The proximity of Congdon's (2006) approximation with the true value in Figure [T] 
shows that the method could possibly be used as a cheap first-order substitute of the 
true posterior probability if the bias was better assessed. First, we note that when all 
the componentwise posteriors are close to Dirac point masses at values 6k, Congdon's 
(2006) approximation is close to the true value 

/ ^ 



Further, the posterior expectation of fk{y\dk )''^k{dk ) involves the integral of 

fk{y\ek?T^k{ek? 



I 



rrikiy) 



thus the bias is likely to be small in settings where the product /fc(2/|^fe '')7r/c(0[*^) is 
peaked as in large samples, for instance. That the bias can almost completely disappear 
is exposed through a second toy example. 



Robert, CP., and Marin, J.-M. 9 

Example 2. Consider the case when a normal model VJli : y ~ J^{6, 1) with a prior 
9 ^ A/'(0, 1) is opposed to a normal model DJI2 : y ~ A/'(6', 1) with a prior ~ A/'(5, 1). 
We again assume equal prior weights. 

In that case, the marginals are available in closed form 

1 1 (y — 5)^ 

"ti(y) = ^= exp-— and TO2(y) = ^= exp 

V47r 4 V47r 4 



and the posterior probability of model 2Jli is 

F(Af = l|j/)=(l + exp^fc^ 



For argumentation's sake, assume that we now produce both sequences {9i^) and (6*2*'') 
from the posterior distributions N{y/2,\/2) and N{{y + 5)/2,l/2), respectively, by 
using the same sequence of et ^ A/'(0, 1), i.e. 

Using those sequences, we then obtain that 



exp-i(y - - - 5)2 exp-i(y - _ _i^^^)2 _ + _ 5)2 

exp-ld-^.Q^-ld + ^.Q^ 

5, 

= exp--(2y-5), 

independently of et, and thus that Congdon's (2006) approximation is truly exact using 
this device! Figure [2] shows the difference due to using two independent sequences of 10"* 
et's [instead of one single sequence] and the severe discrepancy resulting from Scott's 
approximation. (Note that using an artificial MCMC sampler in this case would only 
increase the variability of the approximations.) < 

The approximation may also be rather crude, as shown in the following example, 
inspired from an example posted on Peter Congdon's web-page in connection with [Con^ 



gdon (2007) 



Example 3. Consider comparing 37li : y ^ B{n,p) when p ^ Be{l, 1) with TI2 : y ^ 
B{n,p) when p ^ Beim, m). Once again, the posterior probability can be computed in 
closed form since the Bayes factor is given by 

^ (n+1)! (to + 2/ — l)!(m + n — 2/ — 1)! (to— 1)!^ 
~ y\{n~y)\ (m + n - 1)! (2to - 1)! ' 
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Figure 2: Example [2| Comparison of two approximations of P{M = l\y) with the 
true value (in blue and full lines): Scott's (2002) approximation (in green and mixed 
dashes) and Congdon's (2006) approximation (in brown and long dashes) (based on 
= 10^ simulations). 



The simulations of p^*-* from the posterior Be{y + l,n ~ y + I) in model 9Jli and of 
from the posterior Be(y + m, m + n — y) in model 9Jt2 are straightforward (and obviously 
do not require an extra MCMC step). Figure |3] shows the impact of Congdon's (2006) 
approximation on the evaluation of the posterior probability for n = 15 and m = 100: 
the magnitude is the same but, in that case, the numerical values are quite different. 

In the case of three models in competition, namely when y ^ B{n,p) and the three 
priors are p ~ I3e{l, 1), p ^ Be{a,b) and p ~ Be{c,d), the differences may be of the 
same order, as shown in Figure [4] but the discrepancy is nonetheless decreasing with 
the sample size n. 



At last, the approximation may fall very far from the mark, as demonstrated in the 
following example where the approximation has an asymptotic behaviour opposite to 
the one of the true posterior probability. 

Example 4. Consider comparing 9}ti : y ^ 1/w) with lo ^ £xp{a) against 9II2 : 
exp(?/) ^ £xp{\) with A ~ £xp(b). The corresponding marginals are given in closed 
form by 



27r 72^ (a + yV2)3/2 
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y 

Figure 3: Example [sj Comparison of Congdon's (2006) (in brown and dashed lines) 
approximation of P(M = l\y) with the true value (in blue and full lines) when n = 15 
and m = 510 (based on = 10^ simulations). 



and 

bey 
(b + ev) 



m2{y) = / Xe-'''^be-''^dX 
Jo 



2 



The associated posteriors are uj\y ^ 5a(3/2,a + and X\y ^ Oa{2,b + e^). Figure 

[5] shows the comparison of the true posterior probability of Tli with the approximation 
for various values of (a, 6) and it indicates a very poor fit when y goes to +00. 

It is actually possible to show that the approximation always converges to when 
y goes to +00, while the true posterior probability goes to 1. Indeed, when y goes to 
+00, the Bayes factor is 



e 



2y 



mijy) _ ar(3/2) 

m2{y) ~ bV2^ e!'(yV2)3/2 ' 

which goes to +00 while, since w^*' = et/{a + y"^ /2) and A(*) = vt/{b + eV), with 
et~ 0(3/2,1) and ft ~ 0(2, 1), 

/i(y|cuW)7ri(a>W) ^ a ^Qe'^' ^ + a ^te"' ^/2 

/2(y|A(*))7r2(A(*)) b^ ute-^'* e!'(a + y2/2)i/2 ~ u,e-^t y ' 

which goes to for all (e(,ut). The discrepancy is then extreme. -4 
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Figure 4: Example 1^ Comparison of Congdon's (2006) fin brown and dashed 
lines) approximation of P{M — l\y) with the true value (in blue and full 
lines) when {n,a,b,c,d) is equal to (17,2.5,12.5,501.5,500), (25,1.5,4,540,200), 
(13, .5, 100.5, 20, 10) and (12, .3, 1.8, 200, 200), respectively (based on iV = 10"* simu- 
lations). 
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